Skip to content

Pbs docker ci#47

Merged
jhamman merged 8 commits into
dask:masterfrom
guillaumeeb:pbs_docker_ci
May 14, 2018
Merged

Pbs docker ci#47
jhamman merged 8 commits into
dask:masterfrom
guillaumeeb:pbs_docker_ci

Conversation

@guillaumeeb

Copy link
Copy Markdown
Member

Bringing this work to the attention of everyone. Should close #41.

This is not fully working yet, but I need to begin integration with Travis, and maybe have some feedback.
Locally on my computer, test_basic is passing, but its teardown fails, and I don't really know what's happening. We'll see in Travis.

@jhamman

jhamman commented Apr 27, 2018

Copy link
Copy Markdown
Member

Nice to see that you're making progress here. It doesn't look like travis is picking up the pbs environment in the matrix though...

Also, can you remove the following files:

dask-worker-space/global.lock
dask-worker-space/purge.lock
dask-worker-space/worker-t461kxij.dirlock

WIP: adding CI with a dockerized PBS cluster

almost there

Working pbs docker cluster, fix was to add user on slaves

Test are almost working, may need feedback

Adding new job in Travis.

removing unused files
@guillaumeeb

Copy link
Copy Markdown
Member Author

Thanks for the comment, I've cleaned up commits and removed the file.
Travis is triggered this time, we'll see how it goes.

@mrocklin

Copy link
Copy Markdown
Member

For this failure:

    def close(self, all_fds=False):
        self.closing = True
        for fd in list(self.handlers):
            fileobj, handler_func = self.handlers[fd]
            self.remove_handler(fd)
            if all_fds:
                self.close_fd(fileobj)
        self.asyncio_loop.close()
>       del IOLoop._ioloop_for_asyncio[self.asyncio_loop]
E       KeyError: <_UnixSelectorEventLoop running=False closed=True debug=False>

I recommend using distributed master if possible

@guillaumeeb

Copy link
Copy Markdown
Member Author

So test_adaptive is almost working on my computer: it only fails because the cluster.jobs object is not cleaned up. Otherwise, cluster is automatically scaling up and down as expected.
This is not the case on Travis though, don't know why yet.

If I manage to make it work on Travis as on my laptop, I propose to just comment the part testing if cluster.jobs is empty for now, so that adaptive cluster fix is tackled in another pull request.

Any opinion on that?

@mrocklin

Copy link
Copy Markdown
Member

Removing that part of the test seems fine to me.

@lesteve

lesteve commented May 2, 2018

Copy link
Copy Markdown
Member

If I manage to make it work on Travis as on my laptop, I propose to just comment the part testing if cluster.jobs is empty for now, so that adaptive cluster fix is tackled in another pull request.

Great to see progress on having CIs for PBS! I agree that adaptive tests should be fixed in a separate PR. Last time I looked the problems I found were summed up in #27 (comment).

@guillaumeeb

Copy link
Copy Markdown
Member Author

Thanks for the feedback.
I think I've reproduced the problem from Travis on my laptop, and identified its source: PBS does not manage to really finish the jobs because it cannot copy its stderr and stdout files on the master. Thus it cannot run the job from the adaptive test as the two jobs from the previous one did not release the reserved resources.
I will try to work on it on my free time.

@lesteve lesteve left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments from a quick glance

Comment thread ci/pbs.sh Outdated
# start pbs cluster
cd ./ci/pbs
docker-compose up -d
while [ `docker exec -it -u pbsuser pbs_master pbsnodes -a | grep "Mom = pbs_slave" | wc -l` -ne 2 ]

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have this in start-pbs.sh similarly to what is done in ci/sge/start-sge.sh. IIRC this is useful to run the tests locally.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, locally I just use the commands

cd ci/pbs
docker-compose up -d

I can then see when the cluster is started. I feel there is no need of a script to do this.

I would prefer to avoid having too many scripts, I feel that it is easier to understand this way. When somebody new looks at the Travis build config file, he does'nt have to go through 2 other ones just for the before_install phase.

If we want to make things easier to test locally, may be we should add a script which call in order all the jobqueue_ * from pbs.sh, or better, any jobquepue.sh file.

@lesteve lesteve May 3, 2018

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to avoid having too many scripts, I feel that it is easier to understand this way. When somebody new looks at the Travis build config file, he does'nt have to go through 2 other ones just for the before_install phase.

I totally get your point that using scripts adds another level of indirection. I think we are coming at it from a different point of view: I have looked at quite a few different .travis.yml files and to me the dask-jobqueue is nicely layed out and straightforward to follow. What I am a newbie about though is the docker side and I feel this may be the case for other potential contributers/users of dask-jobqueue. Being able to just run a bash script and not have to know too much about docker is a great advantage.

On top of the docker-compose up -d, my understanding is that the while loop is to check that the cluster is correctly set-up. IIRC when I was working on the SGE CI, there could be a delay of more than 1 minute before you exit the while loop and it's nice to have something at the end confirming that everything went fine.

Also I think having consistency between the different schedulers CI setup is very important.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, you got me with the last consistency argument.

Comment thread ci/none.sh

function jobqueue_after_success {
echo "Hurrah"
function jobqueue_after_script {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remind me the difference between after_success and after_script, is that that after_script can still fail the build but not after_success?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference as I understand it is that after_success is run only if the script succeeded, and after_script is run no matter if the script succeeded or failed.
I don't believe any of those can fail the build.
I needed after_script to debug the failures.

Comment thread ci/pbs/run-master.sh Outdated
qmgr -c "create node pbs_slave_2"

# wait until the end of tests
/bin/sleep 3600

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the sleep? Could we not do the same as in SGE i.e. start a Python HTTP server? To be honest, I am not sure why we do that either ...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two are working. The idea is to have a blocking process so as not to exit the docker run command, and have the container always up during the test.
I just found it weird to start a Python HTTP server just to lock the launching script in a process, but this is a perfectly working solution, which would last longer than mine 🙂 .

Idealy, we should block on one of the PBS process, or SGE in the other case.

@lesteve lesteve May 3, 2018

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just found it weird to start a Python HTTP server just to lock the launching script in a process, but this is a perfectly working solution, which would last longer than mine

OK this may show my sheer ignorance of docker-related things, but my worry is what happens if I start the cluster locally on my laptop, forget about it and come back the next day, will I still be able to do things like docker exec -it pbs_master '/bin/bash -c'

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will I still be able to do things like docker exec -it pbs_master '/bin/bash -c'

Nope, you won't! Cluster will shut down after 1 hour as you expected.

I get your point here too. I don't really like the idea of starting a fake python HTTP server just to have a hanging process, but I don't really have time to look for other solutions, so lets do it this way 😁 .

Comment thread ci/pbs/docker-compose.yml Outdated
@@ -0,0 +1,42 @@
version: "3"

@lesteve lesteve May 3, 2018

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to use version: "2" with my version of docker-compose (1.8.0) (I followed the docker installation doc on a Ubuntu 16.04 box). Is there a reason you are using version: "3" (newbie question again ...)?

@guillaumeeb guillaumeeb May 3, 2018

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I need version 3 here, which is more for using docker stack deploy or things like that IIRC. I will update that.

I find it weird that you installed such an old version of docker-compose though, following official doc should install 1.21 currently: https://docs.docker.com/compose/install/.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably I only followed the install instructions for docker and thought that it would do docker-compose too ... I am guessing this is the docker-compose from apt on Ubuntu 16.04 ...

@guillaumeeb

Copy link
Copy Markdown
Member Author

I believe this PR is ready to be merged, does anyone have some comment?

@jhamman jhamman left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sticking with this one!

@guillaumeeb guillaumeeb changed the title [WIP] Pbs docker ci Pbs docker ci May 7, 2018
@jhamman jhamman merged commit 8d243b6 into dask:master May 14, 2018
@guillaumeeb guillaumeeb deleted the pbs_docker_ci branch August 27, 2018 11:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CI for PBSPro using Docker

5 participants